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Abstract 

Background: As a result of the growing body of protein phosphorylation sites data, the number of 
phosphoprotein databases is constantly increasing, and dozens of tools are available for predicting protein 
phosphorylation sites to achieve fast automatic results. However, none of the existing tools has been developed to 
predict protein phosphorylation sites in rice. 

Results: In this paper, the phosphorylation site predictors, NetPhos 2.0, NetPhosK, Kinasephos, Scansite, Disphos 
and Predphosphos, were integrated to construct meta-predictors of rice-specific phosphorylation sites using several 
methods, including unweighted voting, unreduced weighted voting, reduced unweighted voting and weighted 
voting strategies. Phospho/?/'ce, the meta-predictor produced by using weighted voting strategy with parameters 
selected by restricted grid search and conditional random search, performed the best at predicting 
phosphorylation sites in rice. Its Matthew's Correlation Coefficient (MCC) and Accuracy (ACC) reached to 0.474 and 
73.8%, respectively. Compared to the best individual element predictor (Disphos_default), Phospho/?/ce archieved a 
significant increase in MCC of 0.071 (P < 0.01), and an increase in ACC of 4.6%. 

Conclusions: Phospho/?/'ce is a powerful tool for predicting unidentified phosphorylation sites in rice. Compared to 
the existing methods, we found that our tool showed greater robustness in ACC and MCC. Phospho/?/'ce is 
available to the public at http://bioinformatics.fafu.edu.cn/PhosphoRice. 



Background 

Protein phosphorylation is the most common form of 
protein post-translational modification (PTM) [1-3]. 
Phosphorylation and dephosphorylation of proteins is a 
universal mechanism for regulating protein function in 
the eukaryote, prokaryote and archaea kingdoms. Given 
the importance of protein phosphorylation in regulating 
cellular signaling, large-scale identification of phos- 
phorylated proteins has been carried out in yeast [4], 
mice [5], humans [6], Arabidopsis [7,8], rice [9-12] and 
Medicago [13]. As the data grow, the number and the 
size of the available phosphoprotein databases are 
increasing and are becoming more complex. The Phos- 
pho.ELM database contains validated phosphorylation 
sites that are mostly derived from mammals [14], 
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Phosida contains large-scale data from Homo sapien and 
Bacillus subtilis [15], PhosphoSite (http://www.phospho- 
site.org/) is a curated site that focuses on vertebrate sys- 
tems [16] and PhosPhAt is a phosphorylation site 
database that is specific for Arabidopsis [17]. 

The growing data of protein phosphorylation sites have 
stimulated the development of computational approaches 
to predict these sites from protein sequences. Over the 
past decade, a series of algorithms have been developed to 
predict phosphorylation sites from amino acid sequences 
[18]. A few well-maintained web sites that offer prediction 
of protein phosphorylation sites have been made freely 
available to the scientific community, including NetPhos 
[19], NetPhosK [20], KinasePhos [21], KinasePhos 2.0 [22], 
DISPHOS [23], Scansite [24], PPSP [25], GPS [26], Pre- 
dPhospho [27], NetPhosYeast [28], GANNPhos [29] and 
Musites [30]. However, the existing protein phosphoryla- 
tion site prediction tools show a data sampling bias. The 
predictors perform at a high accuracy only for individual 
species [17]. Many existing prediction programs were 
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primarily derived from mammalian data and exhibit poor 
performance in predicting plant phosphorylation sites. 
Therefore, based on the experimentally validated phos- 
phorylation sites in a specific model organism, organism- 
specific predictors have been developed. NetPhosYeast, a 
yeast-specific predictor, outperforms existing generic pre- 
dictors in the identification of phosphorylation sites in 
yeast [28]. PhosPhAt, which predicts phosphorylated- Ser- 
ine sites in Arabidopsis, is benchmarked to perform better 
with Arabidopsis sequences than other generic predictors 
[17]. To our knowledge, no existing methods have been 
developed to specifically predict protein phosphorylation 
sites in rice. 

As Arabidopsis thaliana (L.) standing as a model of 
dicotyledoneous species, rice (Oryza sativa L.) is a repre- 
sentative model monocotyledoneous (monocot) species. 
Moreover, rice shows an immense socio-economic 
impact on human civilization. In the past decade, with 
proteomic technologies and the availability of the gen- 
ome sequences, rice proteomic research has been pro- 
pelled towards a new height, which is crucial to better 
understand monocot plants [31]. Therefore, rice {Oryza 
Sativa L.) also serves as a cornerstone for the study of 
functional genomics in cereal plants [31]. However, cur- 
rent predictors perform poorly when individually used to 
predict phosphorylation sites in rice phosphoproteins 
[18]. In our previous research work, we constructed three 
different phosphorylation sites datasets to test the perfor- 
mance of different predictors. We found that the phos- 
phorylation site predictors were complementary to some 
extent [18]. Therefore, establishment of a meta-server by 
maximizing complementary of individual predictors 
might be a promising approach to develop an improved 
prediction system. In this study, we developped a rice- 
specific meta-predictor of protein phosphorylation sites 
by integrating the newly individual predictors. 

Results 

Preprocessing performance assessment of element 
predictors 

All of the protein sequences in the dataset were run 
through all 15 element predictors. Perl scripts were 
developed to submit jobs to the servers with the speci- 
fied prediction options and then to analyze the predic- 
tion performance. As shown in Table 1, the element 
predictors showed different performances in predicting 
rice phosphorylation sites. The element predictor that 
provided the best prediction performance was Disphos_- 
default (ACC: 69.2%, MCC: 0.403). 

Unweighted voting, unreduced weighted voting and 
reduced weighted voting strategies 

We combined the element predictors to construct meta- 
predictors using unweighted voting, unreduced weighted 



Table 1 Prediction performance of the element predictors 
on the test dataset 



Element predictor 


Sn (%) 


Sp (%) 


ACC (%) 


MCC 


KinasePhos2.0_80 


81.6 


51.2 


65.5 


0.341 



Predicting performance assessed on the dataset of rice phosphorylation sites. 

voting and reduced weighted voting strategies. In the 
two-class phosphorylation site prediction problems, a 
score threshold must be set. The threshold score was 
set as half of the sum of all of the weights of the ele- 
ment predictors to construct meta-predictor of 
unweighted voting, unreduced weighted voting and 
reduced weighted voting strategies [32]. In this paper, 
the threshold scores (T) were less than half of the total 
weight of the predictors. 

As shown in Table 2, compared to that of the best 
element predictors (ACC: 69.2%, MCC: 0.403), the 
meta-predictors constructed by unweighted voting, 
unreduced weighted voting and reduced weighted voting 
strategies achieved an significant increase in MCC of 
between 0.046 and 0.051. They all had a slight increase 

Table 2 The prediction performance of meta-predictors 
constructed by unweighted voting, unreduced weighted 
voting and reduced weighted voting strategies 



predictor ACC (%) MCC 



Best element predictor 


69.2 


0.403 


(Disphos_default) 






Unweighted voting 


72.4 


0.449 (1.58E-03)* 


Best unreduced weighted voting 


72.5 


0.450 (1.18E-03) * 


(with weights set by ACC) 






Best unreduced weighted voting 


72.8 


0.453 (5.4E-04) * 


(with weights set by MCC) 






Best reduced weighted voting 


72.8 


0.453 (6.0E-04) * 


(with weights set by ACC) 






Best reduced weighted voting 


72.9 


0.454 (3.4E-04) * 


(with weights set by MCC) 







* P-values in Fisher's Z-transformation test (compared with the MCC of the 
best element predictor) are shown in parentheses. 



KinasePhos_default 
KinasePhos_90 
KinasePhos_95 
KinasePhosJOO 

Scansitejow 
Scansite_middle 
Scansite_high 
Prephospho 
DISPHOS_default 
DISPHOS_ Arabidopsis 
DISPHOS_ Eukaryotes 
NetPhosK_0.5 
NetPhosK_0.7 
NetPhos2.0 



80.2 


57.4 


77.0 


62.3 


65.8 


73.7 


37.6 


89.6 


75.9 


54.8 


38.1 


86.6 


12.8 


96.5 


95.5 


13.7 


80.6 


59.1 


43.9 


86.6 


41.7 


87.5 


75.9 


46.6 


17.0 


87.9 


70.7 


59.9 



68.1 


0.383 


69.2 


0.395 


70.0 


0.396 


65.1 


0.321 


64.7 


0.313 


63.8 


0.285 


57.1 


0.173 


52.2 


0.158 


69.2 


0.403 


66.5 


0.341 


66.0 


0.331 


60.4 


0.235 


54.5 


0.070 


65.0 


0.307 
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in ACC of between 3.2% and 3.7%. The meta-predictor 
of reduced weighted voting (with weights set by MCC) 
showed the best prediction performance (MCC: 0.455) 
in all the meta-predictors. 

Restricted grid search and Conditional random search 

We also ran a weighted voting strategy with parameters 
selected by restricted grid search to construct meta-pre- 
dictors for phosphorylation sites in rice. As shown in 
Table 3, we found that the weighted voting strategy 
with the parameters selected by restricted grid search 
produced a satisfactory meta-predictor, which exhibited 
outstanding prediction performance (ACC: 73.5%, MCC: 
0.469). Compared to the best element predictor, they 
improved MCC of 0.066 and ACC of 4.3%. 

Following the restricted grid search, we developed a 
conditional random search scheme to select the value of 
the 16 parameters. We decided that the weight of any 
element predictor would be allowed to fluctuate within 
a certain range, which was between the last grid and the 
next grid of parameter selected by the restricted grid 
search (Table 3). For instance, the weight value of Net- 
Phos2.0 was 1 for the restricted grid search, which last 
grid value was 0 and next grid value was 3. Then, in 
conditional random search, the weight value of Net- 
PhosK_0.5 was set to fluctuate between 0 and 3 (Table 
3). Using this strategy, we produced a conditional 



random search meta-predictor, which possessed the best 
performance than that of all the individual predictors 
and the meta-predictors described above (Table 3). Its 
MCC were 0.071 significantly higher than that of the 
best individual element predictor (Disphos_default), 
while ACC was 4.6% higher than that of the best ele- 
ment predictor. We named this optimal conditional ran- 
dom search meta-predictor VhosphoRice. 

Moreover, we generated the receiver operating char- 
acteristic (ROC) curve according to the predicted 
potentials of meta predictors. ROC is a plot of the 
true-positive ratio (sensitivity) against the false-positive 
ratio (1-specificity). The area under an ROC curve 
(AUC) represents the trade-off between sensitivity and 
specificity. The ROC curves of the prediction perfor- 
mance of all the meta-predictors in comparison to that 
of the best element predictor (Disphos_default) were 
shown in Figure 1. All meta-predictors had higher 
ROC areas than that of the best element predictor 
(Table 4). Meanwhile, we calculated the area under- 
neath ROC curve to compare the predicting perfor- 
mance of PhosphoRice with that of Musite. Musite was 
a Java-based standalone application for predicting both 
general and kinase-specific protein phosphorylation 
sites [30]. Table 5 showed that the performance of 
PhosphoRice was significantly higher than that of 
Musite (Table 5). 



Table 3 The parameters in the weighted voting meta-predictors selected by a restricted grid search and a conditional 
random search 

Element Predictor Parameter selected by Restricted Grid search Random number* Parameter selected by conditional random search 



Predphospho 


0 


Random (1) 


0 


NetPhos2.0 


1 


Random (3) 


1.23 


NetPhosK_0.5 


0 


Random (1) 


0 


NetPhosK_0.7 


0 


Random (1) 


0 


KinasePhos_default 


3 


1+Random (4) 


2.75 


KinasePhos_90 


1 


Random (3) 


2.76 


KinasePhos_95 


0 


Random (1) 


0.79 


KinasePhosJOO 


0 


Random (1) 


0 


DISPHOS_default 


3 


1+Random (4) 


4.25 


DISPH0S_ Eukaryotes 


1 


Random (3) 


1.65 


DISPHOS_Arabidopsis 


1 


Random (3) 


2.22 


KinasePhos2.0_80 


0 


Random (1) 


0.71 


Scansite_middle 


1 


Random (3) 


1.6 


Scansitejow 


3 


1+Random (4) 


3.9 


Scansite_high 


1 


Random (3) 


2.57 


T value 


8 




13.3 


ACC (%) 


73.5 




73.8 



MCC 0.469 (2.60E-06)** 0.474 (6.00E-07) ** 

* Random (3) means the weight could fluctuate from 0 to 3. For instance, by restricted grid search, the weight value of NetphoK 2.0 was 1, and the last grid 
value and next grid value were 0 and 3, respectively. In a conditional random search, the weight of Netphos 2.0 was set as random (3). The weight value of 
KinasePhos_default was 3, and the last grid value and next grid value were 1 and 5, respectively. Therefore, its weight was set as "l+random (4)' in a conditional 
random search. 

** P-values in Fisher's Z-transformation test (compared with the MCC of the best element predictor) are shown in parentheses. 
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1 - Specificity 1 - Specificity 1 . Specificity 1 - Specificity 

Figure 1 Receiver operating characteristics curves of the prediction performance of meta predictors in comparison to that of the best 
element predictor (Disphos_default). In the diagrams, improved classification performance is indicated for predictors with increased area 
under the ROC. The areas under the ROC curve were showed in Table 4. A: ROC curve of unweight-voting predictor in comparison to 
Disphos_default. B: ROC curve of restricted-grid predictor in comparison to Disphos_default. C: ROC curve of random-voting predictor in 
comparison to Disphos_default. D: ROC curve of unreduced-weight-voting predictor in comparison to Disphos_default (by ACC). E: ROC curve of 
unreduced- weight-voting predictor in comparison to Disphos_default (by MCC). F: ROC curve of reduced- weight-voting predictor in 
comparison to Disphos_default (by ACC). G: ROC curve of reduced- weight-voting predictor in comparison to Disphos_default (by MCC). * By 
ACC: the weights of meta-predictor were selected to result in the optimal ACC; By MCC: the weights of meta-predictor were selected to result in 
the optimal MCC. 

V J 



Table 4 Areas under the ROC curves for the best element 
predictor, meta-predictors constructed by unweighted 
voting, unreduced weighted voting, reduced weighted 
voting and weighted voting strategies. 

Predictor Area 

Best element predictor 0.758 
(Disphos_default) 

Unweighted voting 0.788 
Best unreduced weighted voting 0.791 

(with weights set by ACC) 
Best unreduced weighted voting 0.792 
(with weights set by MCC) 
Best reduced weighted voting 0.791 

(with weights set by ACC) 
Best reduced weighted voting 0.791 
(with weights set by MCC) 

Weighted voting 0.794 
(By restricted grid search) 
A combination of weight voting and random 0.796 



Discussion 

Prediction performance of element predictors 

Before being integrated into the meta-predictors, the 
existing phosphorylation site predictors used in this 
study were tested and assessed on the rice phosphoryla- 
tion site dataset All of element predictors achieved an 
ACC over 50.0%. However, their MCC was quite differ- 
ence from each other, which was between 0.07 and 
0.403. Different predictors may yield different perfor- 
mance in phosphorylation sites prediction due to their 
different types of algorithm and training dataset. The 
result also showed that some of kinase family-specific 
predictors could yield good performance under no 



Table 5 The prediction performance of Phospho/?/ce in 
comparison to that of Musite 



Predictor 


ACC (%) 


MCC 


Area 


PhosphoRice 


72.4 


0.474 (0.044) * 


0.796 


Musite 


73.8 


0.446 


0.793 



* P-value in Fisher's Z-transformation test (compared with the MCC of Musite) 
is shown in parenthes. 
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kinase-specific condition, such as KinasePhos_95 (ACC: 
70.0%, MCC: 0.396). 

Prediction performance of unweighted voting, unreduced 
weighted voting and reduced weighted voting meta- 
predictors 

In this paper, the prediction performance of unweighted 
voting, unreduced weighted voting and reduced 
weighted voting meta-predictors exceeded that of the 
best element predictor (ACC: 69.2%, MCC: 0.403), 
showing a significant increase in MCC (P < 0.01). The 
good performance archieved by these meta-predictors 
was due to element predictors' complementing each 
other. The reduced weighted voting strategies had been 
applied to produce meta-predictors in protein subcellu- 
lar localization prediction [33] and phosphorylation site 
prediction for specific kinase family [32]. However, it 
got different result. This strategy produced good meta- 
predictors in the protein subcellular localization predic- 
tion problem [33], but failed to yield meta-predictors 
with expected performance in the prediction of phos- 
phorylation sites for the CK2 kinase family [32]. Wan et 
al. (2008) discussed that the stronger correlation among 
the element predictors might play a role for the failure. 
However, we argued that the selection of element pre- 
dictors was vital to the prediction performance of meta- 
predictors. The prediction performance of six element 
predictors used in this study was evaluated in Que et al. 
(2010). We found that the element predictors were com- 
plementary to some extent. 

Prediction performance of Phospho/?/ce 

In this study, we applied a more general form of the 
weighted voting strategy. First, we used a restricted grid 
search to determine a range for the parameters. Second, 
we set ranges of the parameters selected by the 
restricted grid search to perform a conditional random 
search. The restricted grid search was very efficient in 
running time performance and in parameter selection. It 
has been widely used to construct meta-predictors, 
including a serine/threonine phosphorylation site predic- 
tor [32] and a protein-protein interaction site predictor 
[34]. Using the restricted grid search, we selected 9 non- 
zero weight parameters for the final meta-predictors 
(Table 3). However, a drawback of using a restricted 
grid search is that it might find a local, rather than a 
global, optimum. Therefore, based on the result of 
restricted grid search, we ran an exhaustive search 
approach, conditional random search, to determine the 
16 parameters. The conditional random search produced 
a good meta-predictor, whose rice phosphorylation site 
prediction performance not only exceeded that of the 
best element predictor, but also surpassed that of the 
meta-predictors integrated with unweighted voting, 



unreduced weighted voting and reduced weighted voting 
strategies. We can conclude here that a combined 
restricted grid search and conditional random search 
may be a good approach for determining the parameters 
in weighted voting strategy. 

Conclusion 

To summarize, we created a meta-predictor, Phosphor- 
Zee, using a weighted voting strategy, in which para- 
meters were selected by restricted grid search and 
conditional random search. It shows good performance 
in predicting rice phosphorylation sites, as measured by 
the MCC and ACC. Its MCC were 0.071 significantly 
higher than that of the best individual element predictor 
(Disphos_default), while ACC was 4.6% higher than that 
of the best element predictor. We have also provided a 
web service for the prediction of rice protein phosphory- 
lation sites, which can be accessed at http://bioinfor- 
matics.fafu.edu.cn/PhosphoRice. 

Methods 

Preprocessing of dataset 

We collected rice phosphorylation sites from recent lit- 
erature, including Nakagami et al (2010), and the fea- 
ture table of Swiss-Prot database. After removing the 
redundant phosphorylation sites, the number of serine 
(S), threonine (T) and tyrosine (Y) substrates were 4220, 
605 and 141 respectively (Table 6). These phosphoryla- 
tion sites were involved in 2162 proteins (Additional file 
1). The 25-mer sequences (-12 ~ +12) of phosphoryla- 
tion sites were extracted from the protein sequences 
and constructed as dataset. Because all of the phosphor- 
ylation sites in the positive dataset were experimentally 
verified, they were regarded as (+) sites. The Ser, Thr 
and Tyr residues that were not annotated as phosphory- 
lation sites within the dataset were regarded as (-) sites 
(Le., non-phosphorylation sites). We balanced the posi- 
tive and negative dataset and the sizes of positive dataset 
and negative dataset are equal during cross-validation 
processes (Table 6). 

We used a standard 10-fold cross validation to opti- 
mize the weight of all the individual predictors, and cal- 
culated the ACC and MCC of each meta predictor. The 
dataset was randomly partitioned into 10 subsets, 
including one testing subset and nine training subsets. 



Table 6 Number of phosphoserine, phosphothreonine 
and phosphotyrosine sites in positive and negative 
dataset 



Dataset 


Number of phosphorylation sites 


Total 




Serine Threonine Tyrosine 




Positive dataset 


4220 605 141 


4966 


Negative dataset 


2954 1798 834 


5586 
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The weights are updated and the ACC and MCC were 
recalculated. The new weights were kept only if the 
ACC and MCC increased; otherwise the weights are 
rolled back to the previous values. Using this strategy, 
the meta-predictors were training by shifting the test 
subset stepwise so that all data is used for training and 
test when completed. 

Selection of element predictors 

Six phosphorylation site prediction programs, NetPhosK, 
NetPhos2.0, KinasePhos, PrePhospho 1.0, Scansite and 
DISPHOS, were selected as elemental predicting pro- 
grams. NetPhosK, KinasePhos, PrePhospho 1.0 and 
Scansite are kinase-family-specific phosphoryaltion site 
predictor, while NetPhos2.0 and DISPHOS are not. All 
of the element predictors were run under no kinase-spe- 
cific condition. Their prediction performance was evalu- 
ated in our last research work. Fifteen element 
predictors derived from these programs were used to 
form rice-specific meta-predictors of phosphorylation 
sites (Additional file 2). The methods for obtaining 
these 15 element predictors are described below. 

Netphos and NetPhosK (http://www.cbs.dtu.dk/ser- 
vices/NetPhosK/) use an artificial neural network algo- 
rithm to predict phosphorylation sites. With the 
NetPhosK prediction server, the option "prediction with- 
out filtering" was selected to predict phosphorylation 
sites. The threshold value was set as 0.5 and 0.7 to 
determine whether or not a site is predicted as phos- 
phorylated. The result at each threshold value was 
selected to be an element predictor, they were named 
NetPhosK_0.5 and NetPhosK_0.7. 

DISPHOS (DISorder-enhanced PHOSphorylation site 
predictor, http://core.ist.temple.edu/pred/) uses position- 
specific amino acid composition and predicts structural 
disorder information to distinguish phosphorylation and 
non-phosphorylation sites. In this study, "default predic- 
tor," "Eukaryotes" or "A thaliana" was chosen to predict 
phosphorylation sites in rice and were named Disphos_- 
default, Disphos_Eukaryotes and Disphos_Arabidopsis, 
respectively. 

KinasePhos (http://kinasephos.mbc.nctu.edu.tw/index. 
php) employs a Profile Hidden Markov Model (HMM) 
to predict kinase family-specific phosphorylation sites. 
In this study, KinasePhos was run with the option of 
90%, 95%, 100% prediction specificity and 'by default 
HMM bit score', whilst KinasePhos 2.0 with 80% predic- 
tion specificity, respectively. These five selections 
resulted in four separate element predictors termed 
KinasePhos_90, KinasePhos_95, KinasePhos_100, Kina- 
sePhos_default and KinasePhos 2.0_80. 

Scansite (http://scansite.mit.edu/) uses scores calcu- 
lated from position-specific score matrices (PSSM) to 
search for motifs within proteins that are likely to be 



phosphorylated by specific protein kinases. In this work, 
the setting of a high, medium or low stringency level 
was selected and resulted in the production of three 
separate element predictors named Scansite_high, Scan- 
site_medium and Scansite_low, respectively. 

PredPhospho (http://pred.ngri.re.kr/PredPhospho.htm) 
predicts various kinase-specific phosphorylation sites by 
training SVMs. In this study, the prediction was made 
by considering all kinase groups and families. 

Prediction and performance measures 

It was difficult to compare the numerical scores pro- 
duced by the individual element predictors due to their 
differences in mathematical meaning [32]. In this study, 
the value of the scores was ignored, and instead a binary 
value was assigned (representing phosphorylated or not 
phosphorylated) and then performance was compared 
across prediction programs. 

Four measurements-Sensitivity (Sn), Specificity (Sp), 
Accuracy (ACC) and the Matthew's Correlation Coeffi- 
cient (MCC)-were employed to evaluate the perfor- 
mance of the tested predictors (definitions below): 



TN 

Sp= , 

H TN + FP 

TP + TN 
- 

TP + FP + TN + FN' 

and 

MCC _ (TPxTN)-(FNxFP) 

7 (TP + FN) x (TN + FP) x (TP + FP) x (TN + FN) ' 

where TP, FP, FN, and TN denote true positives, false 
positives, false negatives, and true negatives. Sn and Sp 
illustrate the correct prediction ratios of positive and 
negative datasets, respectively. Because MCC is much 
less susceptible to the ratio of positive samples and 
negative samples in the dataset, it is the most widely 
used prediction measure for two-class prediction pro- 
grams [32]. 

We used SPSS 16.0 to create operating characteristic 
(ROC) curves to measure the performance of meta-pre- 
dictors. For each possible threshold, the sensitivity and 
specificity were evaluated, the ROC curve [sensitivity 
versus (1 -specificity) curve] was plotted, and the area 
underneath this curve was calculated. In this study, 
ROC curves were used to compare the predicting per- 
formance of every meta-predictors with the best element 
predictor, Disphos_default, respectively. The area under- 
neath ROC curve was calculated to compare the pre- 
dicting performance of PhophoRice with Musite, which 
was a newly predictor. 
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Unweighted voting, unreduced weighted voting and 
reduced weighted voting strategies 

The unweighted voting, unreduced weighted voting and 
reduced weighted voting strategies were used to con- 
struct meta-predictors according to the procedure out- 
lined by Liu et al (2007) and Wan et fl/.(2008). 
Generally, if the following condition was satisfied, a lin- 
ear voting-based two-class classifier would make a posi- 
tive prediction: 



(1) 



Where N is the total number of element predictors (in 
this experiment, N = 15), Wj is the weight of the jth pre- 
diction method and Wj = 1 for all element predictors in 
the unweighted voting strategy. Pj is the prediction 
made by the jth predictor; in a positive prediction, Pj = 
1, otherwise Pj = 0. T is the threshold score. 

For a simple weighting voting strategy, the threshold 
T can be set as the half of the total weight of the pre- 
dictors. 



i £ 

1 i=i 



(2) 



Restricted grid search 

In Equation (1), proper weight parameters (wj) would 
produce a classifier with good prediction performance. 
In this study, there are 16 parameters, including 15 pos- 
sible values for w^, and a value for T that needs to be 
determined for the highest performance classifier. We 
applied the restricted grid search method to select the 
values of these 16 parameters, which has been widely 
used in two-class classification problems [32,33]. There 
were two critical restrictions of this method in our 
study. First, we limited the weight of the element pre- 
dictors to be one of the following values: 0, 1, 3, 5, 7, 9, 
11, 13, and 15. Second, the sum of the weights of all 15 
element predictors must be equal to 15 (Table 7). The 
restricted grid search of the 16 parameters was con- 
ducted on the dataset with 10-fold cross-validation. 

Conditional random search 

Conditional random fields were first introduced by Laff- 
erty and colleagues in 2001 [35]. For the conditional 
random search, the threshold T was set as a random 
value of the total weight of the predictors. 



T = rand I ^ Wj 



(3) 



Table 7 Weight combinations, permutations and possible 
weights sum values in the restricted grid search scheme 

Weight combinations* Number of correspo 



I weight' 



15 x (1) 


P 1 

1 15 


= 15 


1 x (2)+13 x (1) 


P 2 
1 15 


v p 1 = 1365 


1 x (1)+3 x (1)+1 1 x (1) 


P 1 
1 15 


X pj 4 X p{ 3 = 2730 


] x (4) + i 1 x (1) 


P 4 
1 15 


v p 1 = 15015 


1 x (1)+5 x (1)+9 x (1) 


P 1 
1 15 


v P 1 v P 1 = 2730 
i 14 x i 13 


3 x (2)+9 x (1) 


P 2 
1 15 


v P 1 = 1365 


1 x (3)+3 x (1)+9 x (1) 


P 3 
1 15 


v P 1 v p 1 = 60060 


1 x (6)+9 x (1) 


P 6 
1 15 


v p 1 = 45045 


1 x (1)+7 x (2) 


P 1 
1 15 


v p 2 = 1365 


3 x (iVfS x (11+7 x C\) 


P 1 
* 15 


x i^ 14 x i^ 13 ^/ju 


1 x (3)+5 x (1)+7 x (1) 


P 3 
1 15 


v P 1 v P 1 = 60060 
* 12 * 11 


1 x (2)+3 x (2)+7 x (1) 


P 2 
1 15 


v P 2 v P 1 = 90090 
i 13 x i ii ^^^^^ 


1 x (5)+3 x (1)+7 x (1) 


P 5 


v P 1 v P 1 = 270270 
i 10 x 9 

v p 1 = 450450 


1 x (8)+7 x (1) 


P 8 
1 15 


5 x (3) 


P 3 
1 15 


= 455 


1 x (2)+3 x (1)+5 x (2) 


P 2 
1 15 


v P 1 v P 2 = 90090 


1 x (5)+5 x (2) 


1 15 


X P? A = 135135 
10 


I A ^ I JiO A \D)-rD A ^ I ) 


Pis 


x P 3 4 x P l n = 60060 


1 x 4+3 x 2+5 x 1 




X P 2 X XPI= 675675 


1 x (7)+3 x (1)+5 x (1) 




x?jxpj= 360360 


1 x (10)+5 x (1) 


Pl°5 


xpj = 15015 


3 x (5) 


P\s 


= 3003 


1 x (3)+3 x (4) 


P\s 


X P\ 2 = 225225 


1 x (6)+3 x (3) 


Pts 


X P 3 = 420420 


1 x (9)+3 x (2) 


Pis 


x p2 = 75075 


1 x (12)+3 x (1) 


p\l 


xpl = 1365 


1 x (15) 


p\l 


= 1 


Possible weighted 


0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15 



J =1 



* Weight combinations are denoted as the sum of each weight value 
multiplied by the number of weights taking the weight value, with the weight 
value = 0 omitted. 

** For instance, "15 x (1)"represents that 1 of the 15 weights takes the value 
15, and the other 14 weights take the value 0; and "1 x (1)+3 x (1)+11 x (1)" 
represents that 1 of the 15 weights takes the value 1, 1 weight takes the 
value 3, 1 weight takes 11 and the remaining 12 weights take the value 0. 
Each weight combination corresponds to one or more weight permutations. 
For instance, for weight combination "15 x (1)," the weight value 15 can be 
taken by each of the 15 weights; thus, it corresponds to pj 5 weight 
permutations. 



Randomized algorithms are often simple, beautiful and 
efficient for selecting parameters. They produce a series 
of unrelated and unpredictable digits or characters. 
However, the computer cannot produce an absolute 
random number; it can only have a "pseudorandom 
number". The conditional random search method can 
be represented as follows: 

a. the weight selected by restricted grid search; 

b. random search range was set between the last grid 
and the next grid of parameter selected by the 
restricted grid search; 
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c. runuing random search program; 

d. training on the training set, test on the test set; 

e. stopping at the parameter combination that 
achieve higher MCC than that of restricted grid 
search. 



Additional material 



Additional file 1: Rice phosphorylation sites data. Data file listing 
Accession Number, full-length sequence, phosphorylated amino acid and 
its site position. 

Additional file 2: Summary of the 15 element predictors. Summary 
file listing the name, references and URLs of the 15 element predictors 
used to produce m eta -predictors. 
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